{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hdf5 File format\n",
    "Vaex uses [hdf5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (Hierarchical Data Format) for storing data. You can think of hdf5 files as being a file system, where the 'files' contain N-dimensional arrays, or think of it as the binary equivalent of XML files. Being almost like a filesystem, you can store data anyway, for instance under '/mydata/somearray'. \n",
    "\n",
    "For vaex we based our layout on [VOTable](http://www.ivoa.net/documents/VOTable/), any recommendation, comments or requests to standardize are welcome.\n",
    "\n",
    "In vaex, every column is stored under /data, which can be found out using the h5ls tool\n",
    "```\n",
    "$ h5ls data/helmi-dezeeuw-2000-10p.hdf5 \n",
    "data                     Group\n",
    "```\n",
    "\n",
    "\n",
    "All columns are stored under this group, and can be listed:\n",
    "\n",
    "```\n",
    "$ h5ls data/helmi-dezeeuw-2000-10p.hdf5/data\n",
    "E                        Dataset {330000}\n",
    "FeH                      Dataset {330000}\n",
    "L                        Dataset {330000}\n",
    "Lz                       Dataset {330000}\n",
    "random_index             Dataset {330000}\n",
    "vx                       Dataset {330000}\n",
    "vy                       Dataset {330000}\n",
    "vz                       Dataset {330000}\n",
    "x                        Dataset {330000}\n",
    "y                        Dataset {330000}\n",
    "z                        Dataset {330000}\n",
    "```\n",
    "\n",
    "If you for some reason don't want to use vaex, but access the data using Python, you could do something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('mean FeH', -1.6934730008384034, 'length', 330000)\n"
     ]
    }
   ],
   "source": [
    "import h5py\n",
    "import numpy as np\n",
    "h5file = h5py.File(\"/Users/users/breddels/src/vaex/data/helmi-dezeeuw-2000-10p.hdf5\", \"r\")\n",
    "FeH = h5file[\"/data/FeH\"]\n",
    "# FeH is your regular numpy array (with some extras)\n",
    "print(\"mean FeH\", np.mean(FeH), \"length\", len(FeH))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More information about a column can be found using:\n",
    "```\n",
    "h5ls -v data/helmi-dezeeuw-2000-10p.hdf5/data/FeH \n",
    "Opened \"data/helmi-dezeeuw-2000-10p.hdf5\" with sec2 driver.\n",
    "FeH                      Dataset {330000/330000}\n",
    "    Attribute: ucd scalar\n",
    "        Type:      variable-length null-terminated ASCII string\n",
    "        Data:  \"phys.abund.fe\"\n",
    "    Attribute: unit scalar\n",
    "        Type:      variable-length null-terminated ASCII string\n",
    "        Data:  \"dex\"\n",
    "    Location:  1:2644064\n",
    "    Links:     1\n",
    "    Storage:   2640000 logical bytes, 2640000 allocated bytes, 100.00% utilization\n",
    "    Type:      native double\n",
    "```\n",
    "Here we see that the (similar to VOTable), we have a [ucd attribute](http://www.ivoa.net/documents/latest/UCD.html) which describes what the column represents, and its units.\n",
    "\n",
    "These can be accessed using h5py as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('phys.abund.fe', 'dex')\n"
     ]
    }
   ],
   "source": [
    "print(FeH.attrs[\"ucd\"], FeH.attrs[\"unit\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Further restrictions are the first character of the column name should be an underscope (_) an ascii letter (a-z or A-Z), and following characters can also include a digit.\n",
    "\n",
    "For completeness, the layout is as follows\n",
    "\n",
    " * /data\n",
    "  * /column1 (with optional attribute ucd and unit)\n",
    "  * ...\n",
    "  * /columnN (with optional attribute ucd and unit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
